# Import packages
import pandas as pd
import yfinance as yf
import pandas_datareader as pdr
L15: Linear regression intro
Preliminaries
Load Fama-French factor data:
= pdr.DataReader('F-F_Research_Data_Factors', 'famafrench', '2012-01-01')[0]/100
ff3f 2) ff3f.head(
Mkt-RF | SMB | HML | RF | |
---|---|---|---|---|
Date | ||||
2012-01 | 0.0505 | 0.0203 | -0.0097 | 0.0 |
2012-02 | 0.0442 | -0.0185 | 0.0043 | 0.0 |
Download monthly prices (keep only Adjusted Close prices):
= yf.download('TSLA', '2012-12-01', '2020-12-31', interval = '1mo')['Adj Close'].dropna().to_frame()
firm_prices 2) firm_prices.head(
[*********************100%***********************] 1 of 1 completed
Adj Close | |
---|---|
Date | |
2012-12-01 | 2.258000 |
2013-01-01 | 2.500667 |
Calculate monthly returns, drop missing and rename “Adj Close” to “TSLA”:
= firm_prices.pct_change().dropna().rename(columns = {'Adj Close': 'TSLA'})
firm_ret 2) firm_ret.head(
TSLA | |
---|---|
Date | |
2013-01-01 | 0.107470 |
2013-02-01 | -0.071448 |
We need to merge firm_ret
with ff3f
but note that their dates look different. Check their format first:
firm_ret.index.dtype
dtype('<M8[ns]')
ff3f.index.dtype
period[M]
Convert index of firm_ret
to monthly period, to match the date format un ff3f
:
= firm_ret.index.to_period('M')
firm_ret.index 2) firm_ret.head(
TSLA | |
---|---|
Date | |
2013-01 | 0.107470 |
2013-02 | -0.071448 |
Merge the two datasets:
= firm_ret.join(ff3f)
data 2) data.head(
TSLA | Mkt-RF | SMB | HML | RF | |
---|---|---|---|---|---|
Date | |||||
2013-01 | 0.107470 | 0.0557 | 0.0033 | 0.0096 | 0.0 |
2013-02 | -0.071448 | 0.0129 | -0.0028 | 0.0011 | 0.0 |
'const'] = 1
data[ data
TSLA | Mkt-RF | SMB | HML | RF | const | |
---|---|---|---|---|---|---|
Date | ||||||
2013-01 | 0.107470 | 0.0557 | 0.0033 | 0.0096 | 0.0000 | 1 |
2013-02 | -0.071448 | 0.0129 | -0.0028 | 0.0011 | 0.0000 | 1 |
2013-03 | 0.087855 | 0.0403 | 0.0081 | -0.0019 | 0.0000 | 1 |
2013-04 | 0.424914 | 0.0155 | -0.0236 | 0.0045 | 0.0000 | 1 |
2013-05 | 0.810706 | 0.0280 | 0.0173 | 0.0263 | 0.0000 | 1 |
... | ... | ... | ... | ... | ... | ... |
2020-08 | 0.741452 | 0.0763 | -0.0022 | -0.0296 | 0.0001 | 1 |
2020-09 | -0.139087 | -0.0363 | 0.0004 | -0.0268 | 0.0001 | 1 |
2020-10 | -0.095499 | -0.0210 | 0.0437 | 0.0422 | 0.0001 | 1 |
2020-11 | 0.462736 | 0.1247 | 0.0581 | 0.0213 | 0.0001 | 1 |
2020-12 | 0.243252 | 0.0463 | 0.0489 | -0.0150 | 0.0001 | 1 |
96 rows × 6 columns
Linear regression basics
A linear regression is a statistical model, which means it is a set of assumptions about the relation between two or more variables. In particular, the standard linear regression assumptions are (we restrict ourselves to two variables X and Y for now):
A1. Linearity
The relation between the variables is assumed to be linear in parameters:
\[Y_t = \alpha + \beta \cdot X_t + \epsilon_t \]
Note that “linear in parameters” means the function that describes the relation between X and Y (the equation above) is linear with respect to \(\alpha\) and \(\beta\) (e.g. \(Y = \alpha \cdot X^{\beta} + \epsilon\) is not linear in parameters). It does not mean that the relation needs to be linear with respect to X (e.g. \(Y = \alpha + \beta \cdot X^2 + \epsilon\) is still linear in parameters).
Before we cover the remaining assumptions, a bit of terminology:
Y is commonly referred to as the “dependent”, or “explained” or “endogenous” variable
X is commonly referred to as the “independent”, or “explanatory”, or “exogenous” variable (though, remember, that X can stand for more than one variables)
\(\epsilon\) is commonly referred to as the “residual” of the regression, or the “error” term, or the “disturbance” term
\(\alpha\) (alpha) and \(\beta\) (beta) are the “coefficients” or “parameters” of the regression. The outcome of “running a regression” is to calculate estimates for these alpha and beta coefficients.
the t subscript is meant to represent the fact that we observe multiple realizations of the X and Y variables, and the linear relation is assumed to hold for each set of realizations (different t’s can represent different points in time, or different firms, different countries, etc). Going forward in this section, we will assume that t stands for time, to make the interpretation clearer.
A2. Mean independence
This assumption states that the independent variable(s) X convey no information about the disturbance terms (\(\epsilon\)’s). Technically, we write this assumption as:
\[E[\epsilon_t | X] = 0\]
This is also called the “strict exogeneity” assumption. When this condition is not satisfied, we say that our regression model has an endogeneity problem.
A4. Full rank
This assumption states that there are no exact linear relationships between the explanatory variables X (when there are two or more such variables). When this assumption is not satisfied, we say we have a multicollinearity problem.
In the next few lectures, we will cover strategies that we can use when some of the above assumptions are not satisfied.
Regression fitting: ordinary least squares (OLS)
By far the most common method for estimating linear regression coefficients is by minimizing the sum of the squares of the error terms (hence “least squares”).
The package we will use for linear regression fitting is called “statsmodels”. Install this package by typing the following in a terminal (or Anaconda Prompt):
pip install statsmodels
In fact, for the most part, we will only use the “api” subpackage of “statsmodels” as below. Here is the official documentation for the package if you want to learn more about it:
https://www.statsmodels.org/stable/index.html
import statsmodels.api as sm
As mentioned above, we will estimate our regression coefficients using OLS (ordinary least squares). This can be done with the OLS
function of statsmodels
:
Syntax:
class statsmodels.regression.linear_model.OLS(endog, exog=None, missing='none', hasconst=None, **kwargs)
When we use this function, we can replace statsmodels.regression.lineal_model
with sm
(as imported above). The endog
parameter is where we specify where the data for our dependent variable is, and the exog
parameter is where we specify where our independent variables are. We usually set missing=True
to tell Python that we want to get rid of all the rows in our regression data that have any missing values.
Example 1: estimating a stock’s alpha and beta using the market model
The market model (aka the “single-factor model” or the “single-index model”) is a linear regression model that relates the excess return on a stock to the excess returns on the market portfolio:
\[R_{i,t} - R_{f,t} = \alpha_i + \beta_i (R_{m,t} - R_{f,t}) + \epsilon_{i,t}\]
where: - \(R_{i,t}\) is the return of firm \(i\) at time \(t\) - \(R_{m,t}\) is the return of the market at time \(t\) (we generally use the S&P500 index as the market portfolio) - \(R_{f,t}\) is the risk-free rate at time \(t\) (most commonly the yield on the 1-month Tbill)
Below, we estimate this model for TSLA, using the data we gathered at the top of these lecture notes:
To “run” (i.e. “fit” or “estimate”) the regression, we use the .fit()
function which can be applied after the sm.OLS()
function. We store results in “res”:
= sm.OLS(endog = data['TSLA']-data['RF'],
res = data[['const','Mkt-RF']],
exog ='drop'
missing
).fit() res
<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7fb7bb37f310>
The above shows that res
is a “RegressionResultsWrapper”. We have not seen this kind of an object before. Check all the attributes of the results (res
) object:
print(dir(res))
['HC0_se', 'HC1_se', 'HC2_se', 'HC3_se', '_HCCM', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abat_diagonal', '_cache', '_data_attr', '_data_in_cache', '_get_robustcov_results', '_is_nested', '_use_t', '_wexog_singular_values', 'aic', 'bic', 'bse', 'centered_tss', 'compare_f_test', 'compare_lm_test', 'compare_lr_test', 'condition_number', 'conf_int', 'conf_int_el', 'cov_HC0', 'cov_HC1', 'cov_HC2', 'cov_HC3', 'cov_kwds', 'cov_params', 'cov_type', 'df_model', 'df_resid', 'eigenvals', 'el_test', 'ess', 'f_pvalue', 'f_test', 'fittedvalues', 'fvalue', 'get_influence', 'get_prediction', 'get_robustcov_results', 'info_criteria', 'initialize', 'k_constant', 'llf', 'load', 'model', 'mse_model', 'mse_resid', 'mse_total', 'nobs', 'normalized_cov_params', 'outlier_test', 'params', 'predict', 'pvalues', 'remove_data', 'resid', 'resid_pearson', 'rsquared', 'rsquared_adj', 'save', 'scale', 'ssr', 'summary', 'summary2', 't_test', 't_test_pairwise', 'tvalues', 'uncentered_tss', 'use_t', 'wald_test', 'wald_test_terms', 'wresid']
Particularly important results are stored in summary()
, params
, pvalues
, tvalues
, rsquared
. We’ll cover all of these below. As the name suggests, the summary()
attribute contains a summary of the regression results:
print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.155
Model: OLS Adj. R-squared: 0.146
Method: Least Squares F-statistic: 17.26
Date: Wed, 22 Mar 2023 Prob (F-statistic): 7.18e-05
Time: 07:32:31 Log-Likelihood: 30.144
No. Observations: 96 AIC: -56.29
Df Residuals: 94 BIC: -51.16
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.0416 0.019 2.186 0.031 0.004 0.079
Mkt-RF 1.8297 0.440 4.154 0.000 0.955 2.704
==============================================================================
Omnibus: 29.750 Durbin-Watson: 1.561
Prob(Omnibus): 0.000 Jarque-Bera (JB): 56.488
Skew: 1.227 Prob(JB): 5.42e-13
Kurtosis: 5.846 Cond. No. 24.2
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Example 2: estimating a stock’s alpha and beta(s) using the Fama-French three-factor model
The Fama-French three factor model is a linear regression model that relates the excess return on a stock to the excess returns on the market portfolio and the returns on the SMB (small minus big) and HML (high minus low book-to-market) factors:
\[R_{i,t} - R_{f,t} = \alpha_i + \beta_{i,m} (R_{m,t} - R_{f,t}) + \beta_{i,smb} R_{smb} + \beta_{i,hml} R_{hml} + \epsilon_{i,t}\]
Challenge:
Estimate this regression for TSLA using the data we gathered at the top of these lecture notes.
# Run the regression and print the results
= sm.OLS(endog = data['TSLA']-data['RF'],
res3 = data[['const','Mkt-RF','SMB','HML']],
exog ='drop'
missing
).fit()
res3print(res3.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.164
Model: OLS Adj. R-squared: 0.137
Method: Least Squares F-statistic: 6.036
Date: Wed, 22 Mar 2023 Prob (F-statistic): 0.000847
Time: 07:32:31 Log-Likelihood: 30.677
No. Observations: 96 AIC: -53.35
Df Residuals: 92 BIC: -43.10
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.0371 0.020 1.872 0.064 -0.002 0.076
Mkt-RF 1.8810 0.480 3.915 0.000 0.927 2.835
SMB 0.1931 0.794 0.243 0.808 -1.384 1.770
HML -0.6717 0.667 -1.007 0.317 -1.997 0.653
==============================================================================
Omnibus: 31.346 Durbin-Watson: 1.587
Prob(Omnibus): 0.000 Jarque-Bera (JB): 62.467
Skew: 1.268 Prob(JB): 2.73e-14
Kurtosis: 6.030 Cond. No. 45.0
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Interpreting regression results
Coefficients
The “const” row contains information about the firm’s \(\alpha\) and the “Mkt-RF” row contains information about the firm’s \(\beta\). The \(\alpha\) and \(\beta\) coefficient estimates themselves are in the “coef” column (\(\alpha = 0.0416\), and \(\beta = 1.83\) in the single-factor model).
The “const” coefficient tells us what we should expect the return on TSLA to be on a day with no systematic shocks (i.e. a day with market return of 0).
The ‘Mkt-RF’ coefficient tells us how we should expect the return on TSLA to react to a given shock to the market portfolio (e.g. the 1.83 coefficient tells us that, on average, if the market goes up by 1%, TSLA goes up by 1.83%, and when the market goes down by 1%, TSLA goes down by 1.83%)
The results object (res
) stores the regression coefficients in its params
attribute:
res.params
const 0.041642
Mkt-RF 1.829738
dtype: float64
Note that res.params
is a Pandas Series, so we can access its individual elements using the index labels:
print('Alpha = ', res.params['const'])
print('Market Beta = ', res.params['Mkt-RF'])
Alpha = 0.041642456171143115
Market Beta = 1.8297380496913744
Challenge:
Print out (separately) the alpha and each of the betas from the three-factor model
print("three-factor alpha:", res3.params['const'])
print("three-factor market beta:", res3.params['Mkt-RF'])
print("SMB beta:", res3.params['SMB'])
print("HML beta:", res3.params['HML'])
three-factor alpha: 0.037089953287553816
three-factor market beta: 1.881000087482544
SMB beta: 0.19308158602997738
HML beta: -0.6717493189696122
Statistical significance
The p-values are in the “P > |t|” column. P-values lower than 0.05 allow us to conclude that the corresponding coefficient is statistically different from 0 at the 95% confidence level (i.e. reject the null hypothesis that the coefficient is 0). At the 99% confidence level, we would need the p-value to be smaller than 1% (1 minus the confidence level) to reject the null hypothesis that alpha = 0.
The t-statistics for the two coefficients are in the “t” column. Loosely speaking a t-statistic that larger than 2 or smaller than -2 allows us to conclude that the corresponding coefficient is statistically different from 0 at the 95% confidence level (i.e. reject the null hypothesis that the coefficient is 0). In terms of statistical significance, the t-statistic does not provide any new information over the p-value.
The last two columns give us the 95% confidence interval for each coefficient.
For the market model, TSLA’s alpha has a p-value of 0.031 so we can conclude that it’s alpha is statistically significantly different from 0 at the 95% confidence level (but not at the 99% confidence level).
The fact that the alpha is positive and statistically different from 0 (at 95% level) means that, based on the single-factor model, TSLA seems to be undervalued. A negative alpha would mean the stock in overvalued.
If we can not reject the null hypothesis that alpha is 0, the conclusion is NOT that alpha = 0 and therefore the stock is correctly valued (since we can never “accept” a null hypothesis, we can only fail to reject). The conclusion is that we do not have enough evidence to claim that the stock is either undervalued or overvalued (which is not the same thing as saying that we have enough evidence to claim that the stock is correctly valued).
The results object (res
) stores the regression p-values in its pvalues
attribute:
res.pvalues
const 0.031312
Mkt-RF 0.000072
dtype: float64
The p-values can be accessed individually:
print("Alpha p-value = ", res.pvalues['const'])
print("Beta p-value = ", res.pvalues['Mkt-RF'])
Alpha p-value = 0.031311923663591416
Beta p-value = 7.182372372221883e-05
T-statistics are stored in the tvalues
attribute:
res.tvalues
const 2.185832
Mkt-RF 4.154374
dtype: float64
T-statistics of individual coefficients:
print("Alpha t-stat = ", res.tvalues['const'])
print("Beta t-stat = ", res.tvalues['Mkt-RF'])
Alpha t-stat = 2.1858319462947398
Beta t-stat = 4.154374466130978
Challenge:
Is TSLA mispriced (undervalued OR overvalued) at the 5% significance level with respect to the Fama-French 3-factor model?
print("TSLA mispriced? \n", res3.pvalues['const'] < 0.05)
TSLA mispriced?
False
Challenge:
Does TSLA have a significant exposure (at 5% level) to either of the 3 factors in the Fama-French model?
print('Significant market exposure?\n', res3.pvalues['Mkt-RF'] < 0.05)
print('Significant exposure to SMB?\n', res3.pvalues['SMB'] < 0.05)
print('Significant exposure to HML?\n', res3.pvalues['HML'] < 0.05)
Significant market exposure?
True
Significant exposure to SMB?
False
Significant exposure to HML?
False
The R-squared coefficient
The R-squared coefficient (top-right of the table, also referred to as the “coefficient of determination”) estimates the percentage of the total variance in the dependent variable (Y) that can be explained by the variance in the explanatory variable(s) (X).
In the context of our market-model example, the R-squared tells us the percentage of the firm’s total variance that is systematic in nature (i.e. non-diversifiable). The percentage of total variance that is idiosyncratic (diversifiable) equals 1 minus the R-squared.
The R-squared is stored in the rsquared
attribute of the regression results object:
res.rsquared
0.1551232170825012
Using the market model, our estimates of the percentage or total TSLA variance that is systematic vs idiosyncratic are:
print("Percent of total variance that is systematic: ", res.rsquared)
print("Percent of total variance that is idiosyncratic: ", 1 - res.rsquared)
Percent of total variance that is systematic: 0.1551232170825012
Percent of total variance that is idiosyncratic: 0.8448767829174988
Challenge:
What percentage of TSLA total variance can be diversified away under the Fama-French 3-factor model?
print("Percent of total variance that is systematic: ", res3.rsquared)
print("Percent of total variance that is idiosyncratic: ", 1 - res3.rsquared)
Percent of total variance that is systematic: 0.16445635788368973
Percent of total variance that is idiosyncratic: 0.8355436421163103
Diagnostics (bottom of the table):
The regression table reports (at the bottom) a few statistics that help us understand if some of the assumption of the linear regression model are not satisfied.
Durbin-Watson: tests for residual autocorrelation. Takes values in [0,4]. Below 2 means positive autocorr. Above 2 means negative aoutocorr. A value of 2 is perfect (means no autocorrelation).
Omnibus: tests normality of residuals. Prob(Omnibus) close to 0 means reject normality
JB: another normality test (null is skew=0, kurt=3). Prob(JB) close to 0 means rejection of normality
Cond. No: tests for multicollinearity. Over 100 is worrisome, but we still have to look at correlations between variables to determine if any of them need to be dropped.